A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 100 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.
The data on the 100 customers in the sample is as follows:
Answer the following questions using complete sentences and attach all output, plots, etc. within this report.
insurance <- read.csv("../CodingAssignment03/Insurance_Data_Group9.csv")
Randomly select 30 observations from the sample and exclude from all modeling. Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 70 observations.
set.seed(123457)
exclude <- sample(nrow(insurance), 30)
train <- insurance[-exclude, ]
test <- insurance[exclude, ]
train %>%
tbl_summary(statistic = list(all_continuous() ~ c("{mean}", # Mean
"{sd}", # Standard Deviation
"{median}", # Median
"{min}", # Minimum
"{max}" # Maximum
)
),
type = all_continuous() ~ "continuous2" # Enhanced formatting for continuous variables
)
| Characteristic | N = 701 |
|---|---|
| Charges | |
| Mean | 13,375 |
| SD | 12,237 |
| Median | 9,570 |
| Min | 1,136 |
| Max | 51,195 |
| Age | |
| Mean | 41 |
| SD | 14 |
| Median | 44 |
| Min | 18 |
| Max | 64 |
| BMI | |
| Mean | 31.1 |
| SD | 5.7 |
| Median | 30.8 |
| Min | 16.0 |
| Max | 47.7 |
| Female | 28 (40%) |
| Children | |
| 0 | 27 (39%) |
| 1 | 15 (21%) |
| 2 | 16 (23%) |
| 3 | 9 (13%) |
| 4 | 2 (2.9%) |
| 5 | 1 (1.4%) |
| Smoker | 16 (23%) |
| WinterSprings | 17 (24%) |
| WinterPark | 23 (33%) |
| Oviedo | 14 (20%) |
| 1 n (%) | |
Provide the correlation between all quantitative variables
cor(train[, c("Charges", "Age", "BMI", "Children")])
## Charges Age BMI Children
## Charges 1.0000000 0.2669529 0.2394231 0.2497586
## Age 0.2669529 1.0000000 0.2372201 0.2987257
## BMI 0.2394231 0.2372201 1.0000000 0.1781634
## Children 0.2497586 0.2987257 0.1781634 1.0000000
Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?
allvar <- lm(Charges ~ Age + BMI + Female + Children + Smoker + WinterPark + WinterSprings + Oviedo, data = train)
summary(allvar)
##
## Call:
## lm(formula = Charges ~ Age + BMI + Female + Children + Smoker +
## WinterPark + WinterSprings + Oviedo, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11292.8 -2542.3 18.9 3010.9 18364.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15510.23 3872.49 -4.005 0.000171 ***
## Age 236.62 47.33 4.999 5.15e-06 ***
## BMI 457.04 117.63 3.886 0.000254 ***
## Female -1722.18 1286.24 -1.339 0.185562
## Children 1401.17 546.30 2.565 0.012799 *
## Smoker 23987.60 1550.21 15.474 < 2e-16 ***
## WinterPark -3948.67 1775.32 -2.224 0.029846 *
## WinterSprings -382.03 1950.68 -0.196 0.845382
## Oviedo -309.50 1913.52 -0.162 0.872042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5074 on 61 degrees of freedom
## Multiple R-squared: 0.848, Adjusted R-squared: 0.8281
## F-statistic: 42.54 on 8 and 61 DF, p-value: < 2.2e-16
plot(allvar)
Transforming Charges into the log function enables us to reduce heteroskedasticity and bring points closer to each other. After transforming charges into the log function, the histogram indicates that the log of charges has a normal distribution. Therefore, log of charges is a better representation of this variable.
Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?
hist(train$Charges) #before
train$lnCharges <- log(train$Charges)
hist(train$lnCharges) #after
hist(train$Charges) #before
train$ChargesSquared <- train$Charges^2
hist(train$ChargesSquared) #after
scatterplotMatrix(train[c(10,2:3,5)]) # grabbing ln charges
par(mfrow=c(1,2)) # Lipton Input to place the charts side by side
train$ChildrenSquared <- train$Children^2
hist(train$ChildrenSquared) #after
train$lnChildren <- log(train$Children)
hist(train$lnChildren) #after
scatterplotMatrix(train[c(10,2:3,12)]) # grabbing lnCharges with lnChildren
par(mfrow=c(1,2)) # Input to place the charts side by side
hist(train$Age) #before
train$AgeSquared <- train$Age^2
hist(train$AgeSquared) #after
hist(train$Age) #before
train$lnAge <- log(train$Age)
hist(train$lnAge) #after
hist(train$BMI) #before
train$BMISquared <- train$BMI^2
hist(train$BMISquared) #after
hist(train$BMI) #before
train$lnBMI <- log(train$BMI)
hist(train$lnBMI) #after
Transforming the Children variable into the log and quadratic form does not change the variable in to a normal distribution. Therefore, the Children variable does not fit the log and quadratic form.
#Model 1
model_1 <- lm(lnCharges ~., data = train[,c(10,2:9)] )
summary(model_1)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7258 -0.1434 0.0042 0.1503 1.4266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.572798 0.255273 25.748 <2e-16 ***
## Age 0.035493 0.003120 11.376 <2e-16 ***
## BMI 0.020083 0.007754 2.590 0.0120 *
## Female 0.121214 0.084788 1.430 0.1579
## Children 0.088441 0.036012 2.456 0.0169 *
## Smoker 1.656425 0.102189 16.209 <2e-16 ***
## WinterSprings 0.007457 0.128588 0.058 0.9539
## WinterPark -0.126200 0.117029 -1.078 0.2851
## Oviedo -0.028646 0.126139 -0.227 0.8211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3345 on 61 degrees of freedom
## Multiple R-squared: 0.8794, Adjusted R-squared: 0.8636
## F-statistic: 55.6 on 8 and 61 DF, p-value: < 2.2e-16
#this model only change is log of charges, all independent variables remain the same.
par(mfrow=c(2,2))
plot(model_1)
#Model 2
model_2 <- lm(lnCharges ~., data = train[,c(10,3:9,15)] ) #pulling only columns I want
summary(model_2)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 3:9, 15)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55977 -0.14355 0.02305 0.10310 1.37177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.188562 0.404219 7.888 6.85e-11 ***
## BMI 0.019762 0.007223 2.736 0.00814 **
## Female 0.109889 0.079076 1.390 0.16968
## Children 0.072828 0.033830 2.153 0.03530 *
## Smoker 1.631300 0.095059 17.161 < 2e-16 ***
## WinterSprings 0.002113 0.119861 0.018 0.98599
## WinterPark -0.130848 0.109018 -1.200 0.23468
## Oviedo -0.040374 0.117440 -0.344 0.73219
## lnAge 1.339429 0.106512 12.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3118 on 61 degrees of freedom
## Multiple R-squared: 0.8952, Adjusted R-squared: 0.8815
## F-statistic: 65.13 on 8 and 61 DF, p-value: < 2.2e-16
#this model uses lncharges and lnAge
par(mfrow=c(2,2))
plot(model_2)
# Model 3
model_3 <- lm(lnCharges ~., data = train[,c(10,14,3:9)] ) #pulling only columns I want
summary(model_3)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 14, 3:9)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87652 -0.16920 0.02645 0.16491 1.41726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.189e+00 2.696e-01 26.663 < 2e-16 ***
## AgeSquared 4.115e-04 4.172e-05 9.862 3.02e-14 ***
## BMI 2.127e-02 8.500e-03 2.503 0.01501 *
## Female 1.304e-01 9.298e-02 1.402 0.16598
## Children 1.065e-01 3.919e-02 2.718 0.00854 **
## Smoker 1.663e+00 1.122e-01 14.823 < 2e-16 ***
## WinterSprings 6.428e-03 1.410e-01 0.046 0.96380
## WinterPark -1.267e-01 1.284e-01 -0.987 0.32778
## Oviedo -2.027e-02 1.386e-01 -0.146 0.88417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3669 on 61 degrees of freedom
## Multiple R-squared: 0.8549, Adjusted R-squared: 0.8359
## F-statistic: 44.92 on 8 and 61 DF, p-value: < 2.2e-16
#this model uses lncharges and AgeSquared
par(mfrow=c(2,2)) #residuals vs fitted
plot(model_3) #q-q residuals
#Model 4
model_4 <- lm(lnCharges ~., data = train[,c(10,2,4:9,17)] ) #pulling only columns I want
summary(model_4)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 17)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73920 -0.13134 0.00264 0.13730 1.41835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.101250 0.784230 6.505 1.63e-08 ***
## Age 0.035289 0.003141 11.236 < 2e-16 ***
## Female 0.118779 0.084788 1.401 0.1663
## Children 0.089731 0.035970 2.495 0.0153 *
## Smoker 1.655747 0.102238 16.195 < 2e-16 ***
## WinterSprings 0.017437 0.128540 0.136 0.8925
## WinterPark -0.120176 0.116477 -1.032 0.3063
## Oviedo -0.029406 0.126237 -0.233 0.8166
## lnBMI 0.613866 0.238249 2.577 0.0124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3346 on 61 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8634
## F-statistic: 55.53 on 8 and 61 DF, p-value: < 2.2e-16
#this model uses lncharges and lnBMI
par(mfrow=c(2,2)) #residuals vs fitted
plot(model_4) #Q-Q Residuals
#Model 5
model_5 <- lm(lnCharges ~., data = train[,c(10,2,4:9,16)] ) #pulling only columns I want
summary(model_5)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 16)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70989 -0.16308 0.00825 0.14457 1.43884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.886e+00 1.816e-01 37.926 <2e-16 ***
## Age 3.573e-02 3.106e-03 11.503 <2e-16 ***
## Female 1.227e-01 8.492e-02 1.445 0.1536
## Children 8.775e-02 3.610e-02 2.431 0.0180 *
## Smoker 1.656e+00 1.023e-01 16.188 <2e-16 ***
## WinterSprings 6.694e-05 1.289e-01 0.001 0.9996
## WinterPark -1.288e-01 1.176e-01 -1.096 0.2776
## Oviedo -2.688e-02 1.262e-01 -0.213 0.8321
## BMISquared 3.049e-04 1.191e-04 2.561 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3349 on 61 degrees of freedom
## Multiple R-squared: 0.8791, Adjusted R-squared: 0.8633
## F-statistic: 55.46 on 8 and 61 DF, p-value: < 2.2e-16
#this model uses lncharges and BMISquared
par(mfrow=c(2,2))
plot(model_5)
#Model 6
model_6 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,12,5)] ) #pulling only columns I want
summary(model_6)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 12,
## 5)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70026 -0.14405 -0.00955 0.13481 1.37468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.620186 0.259924 25.470 < 2e-16 ***
## Age 0.035077 0.003150 11.136 3.11e-16 ***
## BMI 0.018474 0.007929 2.330 0.0232 *
## Female 0.132595 0.085614 1.549 0.1267
## Smoker 1.650975 0.102378 16.126 < 2e-16 ***
## WinterSprings -0.007636 0.129557 -0.059 0.9532
## WinterPark -0.137528 0.117643 -1.169 0.2470
## Oviedo -0.055094 0.129051 -0.427 0.6710
## ChildrenSquared -0.024082 0.024630 -0.978 0.3321
## Children 0.171712 0.092472 1.857 0.0682 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3346 on 60 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8635
## F-statistic: 49.49 on 9 and 60 DF, p-value: < 2.2e-16
#this model uses lncharges and ChildrenSquared
par(mfrow=c(2,2))
plot(model_6)
#make children a dummy variable
train$childdummy <- 0
train$childdummy[train$Children > 0] <- 1
#Model 7
model_7 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,18)] ) #pulling only columns I want
summary(model_7)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 18)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65227 -0.13093 -0.01566 0.13835 1.37804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.642231 0.251498 26.411 < 2e-16 ***
## Age 0.035427 0.003043 11.644 < 2e-16 ***
## BMI 0.016919 0.007792 2.171 0.03380 *
## Female 0.133797 0.083224 1.608 0.11307
## Smoker 1.663931 0.100444 16.566 < 2e-16 ***
## WinterSprings -0.023576 0.121822 -0.194 0.84719
## WinterPark -0.161530 0.112590 -1.435 0.15649
## Oviedo -0.068542 0.123236 -0.556 0.58012
## childdummy 0.263374 0.090093 2.923 0.00485 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3284 on 61 degrees of freedom
## Multiple R-squared: 0.8838, Adjusted R-squared: 0.8685
## F-statistic: 57.97 on 8 and 61 DF, p-value: < 2.2e-16
#this model uses lncharges and dummy children
scatterplotMatrix(train[c(10,2:3,18)]) # grabbing lnCharges
par(mfrow=c(1,2)) # Lipton Input to place the charts side by side
par(mfrow=c(2,2))
plot(model_7)
#
Use the 30 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)
test$lnCharges <- log(test$Charges)
test$childdummy <- 0
test$childdummy[test$Children > 0] <- 1
test$insurance_model_pred <- predict(allvar, newdata = test)
test$model_1_pred <- predict(model_1,newdata = test) %>% exp()
test$model_7_pred <- predict(model_7,newdata = test) %>% exp()
# Finding the error
test$error_bm <- test$insurance_model_pred - test$Charges
test$error_1 <- test$model_1_pred - test$Charges
test$error_7 <- test$model_7_pred - test$Charges
Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.
#
An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.
| Customer | Age | BMI | Female | Children | Smoker | City |
|---|---|---|---|---|---|---|
| 1 | 60 | 22 | 1 | 0 | 0 | Oviedo |
| 2 | 40 | 30 | 0 | 1 | 0 | Sanford |
| 3 | 25 | 25 | 0 | 0 | 1 | Winter Park |
| 4 | 33 | 35 | 1 | 2 | 0 | Winter Springs |
| 5 | 45 | 27 | 1 | 3 | 0 | Oviedo |
#
The owner notices that some of the predictions are wider than others, explain why.
Are there any prediction problems that occur with the five potential clients? If so, explain.